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Abstract 



plicitly as the sequence of TV words Wi 



1,N- 



Language models for speech recognition tend 
to concentrate solely on recognizing the words 
that were spoken. In this paper, we rede- 
fine the speech recognition problem so that 
its goal is to find both the best sequence of 
words and their syntactic role (part-of-speech) 
in the utterance. This is a necessary first 
step towards tightening the interaction between 
speech recognition and natural language un- 
derstanding. 



1 INTRODUCTION 

For recognizing spontaneous speech, the acoustic signal 
is to weak to narrow down the number of word candi- 
dates. Hence, speech recognizers employ a language 
model that prunes out acoustic alternatives by taking 
into account the previous words that were recognized. 
In doing this, the speech recognition problem is viewed 
as finding the most likely wo rd sequence W given the 
acoustic signal (Jelinek, 1985). 



Pr(Wi,jv) = nPr(W i |Wi,i-: 



W = argmaxPr(W|yl) 



Pi( A\W)Pt(W) 
Pr(A) 



are max ■ 

w 



argmaxPr(A|W)Pr(W) 
w 



The last line involves two probabilities that need to be 
estimated — the first due to the acoustic model PrMlVF) 
and the second due to the language model Pr(T/(/). The 
probability due to the language model can be expressed 
as the following, where we rewrite the sequence W ex- 
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To estimate the probability distribution, a training cor- 
pus is typically used from which the probabilities can be 
estimated by relative frequencies. Due to sparseness of 
data, one must define equivalence classes amongst the 
contexts W 1 ^ 1 , which can be done by limiting the con- 
text to an n-gram language model (Jelinek, 1985) and 
also by grouping words into words classes (Brown et al., 
1992). 

Several attempts have been made to incorporate shal- 
low syntactic information to give better equivalence 
classes, where the shallow syntactic information is ex- 
pres sed as part-of-speech (POS) tags (e.g. (J elinek, 
1985), ( |NiesIer and Woodland, 1996| )). A POS tag indi- 
cates the syntactic role that a particular word is playing in 
the utterance, e.g. whether it is a noun or a verb, etc. The 
approach is to use the POS tags of the prior few words to 
define the equivalence classes. This is done by summing 
over all POS possibilities as shown below. 

Pr(Wi|Wi,j.i) 

= Pr(Wi|P M WV.i) Pr(Pi, l |M/ 1 , ! -i) 

= PT^Vi\P 1: iWi,i.i) Pr(PilP M .iWVi) Pr(Pi,i-i|Wi,i-i) 
Pi.i 

Furthermore, the following two assumptions are made to 
simplify the context. 



Pr(Wi\Pi,iWi,i-i) 
Pr(i J i |iViWi,n) 



Pr(Wi|P0 
Pr(P,jP lil _ 1 ) 



However, this approach does not lead to an improve- 
ment in the perfor mance of the sp eech recognizer. For 
instance, Srinivas ( Srinivas, 1996 ) reports that such a 
model results in a 24.5% increase in perplexity over 
a word-based model on the Wall Street Journal, and 
Niesler and Woodland ( fViesler and Woodland, 1997 ) re- 
port an 11.3% increase (but a 22-fold decrease in the 



number of parameters of such a model). Only by inter- 
polating in a word-based model is an improvement seen 
dJelinek, 1985b. 



A more major problem with the above approach is that 
in a spoken dialogue system, speech recognition is only 
the first step in understanding a speaker's contribution. 
One also needs to determine the syntactic structure of the 
words involved, its semantic meaning, and the speaker's 
intention in making the utterance. This information is 
needed to help the speech recognizer constrain the alter- 
native hypotheses. Hence, we need a tighter coupling 
between speech recognition and the rest of the interpre- 
tation process. 

2 REDEFINING THE PROBLEM 

As a starting point, we re-examine the approach of us- 
ing POS tags in the speech recognition process. Rather 
than view POS tags as intermediate objects solely to 
find the best word assignment, we redefine the goal of 
the speech recognition process so that it finds the best 
word sequence and the best POS interpretation given the 
acoustic signal. 

WP = argmaxPrfWPIyl) 

WP 

= argmaxPr(A|VKP)Pr(WP) 

WP 

The first term Pr(A\WP) is the acoustic model, which 
traditionally excludes the category assignment. The sec- 
ond term Pr(WP) is the POS-based language model. 
Just as before, we rewrite the probability of Pr(WP) as 
a product of probabilities of the word and POS tag given 
the previous context. 

Pt(Wi,nPi,n) 

= Yl Pr(W / iPi|Wi,i-iPi, i -i) 

= Yl Pr(Wi|Wi. i .i J Ri,i)Pr(Pi[Wi,t.i J Ri,*.i) 

The final probability distributions are similar to those 
used for POS tagging of written t ext (Charniak et al., 
1993; |Church, 19881 ; |DeRose, 19881 ). However, these ap- 
proaches simplify the probability distributions as is done 
by previous attempts to use POS tags in speech recogni- 
tion language models. 1 As we will show in Section [Oj 
such simplifications lead to poorer language models. 



3 ESTIMATING THE PROBABILITIES 

The probability distributions that we now need to es- 
timate are more complicated then the traditional ones. 
Our approach is to use the decision tree learning algo- 



al., 1984), which uses information theoretic measures to 
construct equivalence classes of the context in order to 
cope with sparseness of data. The decision tree algorithm 
starts with all of the training data in a single leaf node. 
For each leaf node, it looks for the question to ask of the 
context such that splitting the node into two leaf nodes 
results in the biggest decrease in impurity, where the im- 
purity measures how well each leaf predicts the events 
in the node. Heldout data is used to decide when to stop 
growing the tree: a split is rejected if the split does not re- 
sult in a decrease in impurity with respect to the heldout 
data. After the tree is grown, the heldout dataset is used 
t o smooth the pro babilities of each node with its parent 
(|Bahl et al., 19891). 



rithm (Bahl et al., 1989; Black et al., 1992; Breiman et 



1 A n otable exception is the work of Black et al. (B lack et 
al., 1992), who use a decision tree to learn the probability dis- 
tributions for POS tagging. 



3.1 Word and POS Classification Trees 

To allow the decision tree to ask about the words and 
POS tags in the context, we cluster the words and POS 
tags using the algorithm of Brown et al. (Brown et al., 
1992) into a binary classification tree. The algorithm 
starts with each word (or POS tag) in a separate class, and 
successively merges classes that result in the smallest lost 
in mutual information in terms of the co-occurrences of 
these classes. By keeping track of the order that classes 
were merged, we can construct a hierarchical classifica- 
tion of the words. Figure [l] shows a classification tree 
that we grew for the POS tags. The binary classification 
tree gives an implicit binary encoding for each word and 
POS tag, which we show after each POS tag in the figure. 
The decision tree algorithm can then ask questions about 
the binary encoding of the words, such as 'is the third bit 
of the POS tag encoding equal to one?', and hence can 
ask about which partition a word is in. 

Unlike other work that uses classification trees as 
the b asis for the quest ions used by a decision tree 
(e.g. ( Black et al., 1992 )), we treat the word identities 
as a further refinement of the POS tags. This approach 
has the advantage of avoiding unnecessary data fragmen- 
tation, since the POS tags and word identities will not be 
viewed as separate sources of information. We grow the 
classification tree by starting with a unique class for each 
word and each POS tag that it takes on. When we merge 
classes to form the hierarchy, we only allow merges if all 
of the words in both classes have the same POS tag. The 
result is a word classification tree for each POS tag. This 
approach to growing the word trees simplifies the task, 
since we can take advantage of the hand-coded linguistic 
knowledge (as represented by the POS tags). Further- 
more, we can better deal with words that can take on 
multiple senses, such as the word "loads", which can be 
a plural noun (NNS) or a present tense third-person verb 

(PRP)-fj 

2 Words-POS combinations that occur only once in the train- 
ing corpus are grouped together in the class <unknown>, 
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Figure 2: A Word Classification Tree 
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Figure 1 : POS Classification Tree 



In Figure ^, we give the classification tree for the per- 
sonal pronouns (PRP). It is interesting to note that the 
clustering algorithm distinguished between the subjec- 
tive pronouns T, 'we', and 'they', and the objective pro- 
nouns 'me', 'us', and 'them'. The pronouns 'you' and 
'it' can take either case, and the algorithm partitioned 
them according to their most common usage in the train- 
ing corpus. Although distinct POS tags could have been 
added to distinguish between these two cases, it seems 
that the clustering algorithm can make up for some of 
the shortcomings of the tagset.[] 

3.2 Composite Questions 

In the previous section, we discussed the elementary 
questions that can be asked of the words and POS tags 
in the context. However, there might be a relevant parti- 
tioning of the data that can not be expressed in that form. 
For instance, a good partitioning of a node might involve 
asking whether questions q\ and q2 are both true. Us- 
ing elementary questions, the decision tree would need 
to first ask question q± and then ask q2 in the true subn- 
ode created by q%. This means that the false case has 
been split into two separate nodes, which could cause 
unnecessary data fragmentation. 

Unnecessary data fragmentation can be avoided by al- 
lowing composite questions. Bahl et al. (Bahl et al., 
1989) introduced a simple but effective approach for con- 
structing composite questions. Rather than allowing any 
boolean combination of elementary questions, they re- 
strict the typology of the combinations to pylons, which 
have the following form (true maps all data into the true 
subset). 

pylon true 

pylon =4> (pylon A elementary) 
pylon =4> (pylon V elementary) 



The effect of any binary question is to divide the data 
into true and false subsets. The advantage of pylons is 
that each successive elementary question has the effect 
of swapping data from the true subnode into the false or 
vice versa. Hence, one can compute the change in node 



which is unique for each POS tag. 

3 The words included in the <unknown> class are the re- 
flexive pronouns 'themselves', and 'itself, which each oc- 
curred once in the training corpus. 



impurity that results from each successive elementary 
question that is added. This allows one to use a greedy 
algorithm to build the pylon by successively choosing the 
elementary question that results in the largest decrease in 
node impurity. 

We actually employ a beam search and explore the 
best 10 alternatives at each level of the pylon. Again we 
make use of the heldout data to help pick the best pylon, 
but we must be careful not to make too much use of it for 
otherwise it will become as biased as the training data. 
If the last question added to a candidate pylon results in 
an increase in node impurity with respect to the heldout 
data, we remove that question and stop growing that al- 
ternative. When there are no further candidates that can 
be grown, we choose the winning pylon as the one with 
the best decrease in node impurity with respect to the 
training data. The effect of using composite questions is 
explored in Section 4.3. 



we achieve a 43% reduction in perplexity and a 5.4% re- 
duction in the POS error rate. 



4 RESULTS 

To demonstrate our model, we have tested it on the 



Trains corpus (Heeman and Allen, 1995), a collection of 



human-human task-oriented spoken dialogues consisting 
of 6 and half hours worth of speech, 34 different speak- 
ers, 58,000 words of transcribed speech, with a vocab- 
ulary size of 860 words. To make the best use of the 
limited amount of data, we use a 6-fold cross validation 
procedure, in which we use each sixth of the corpus for 
testing data, and the rest for training data. 

A way to measure a language model is to compute the 
perplexity it assigns to a test corpus, which is an estimate 
of how well the language model is able to predict the 
next word. The perplexity of a test set of N words wi.n 
is calculated as follows, 

2~7t Si=i log2 Pr(«>i|«>l.-s-l) 

where Pr is the probability distribution supplied by the 
language model. Full details of how we compute the 
word-based perplexity are given in ( frieeman, 1997 ). We 
also measure the error rate in assigning the POS tags. 
Here, as in measuring the perplexity, we run the language 
model on the hand-transcribed word annotations. 

4. 1 Effect of Richer Context 

Table [l] gives the perplexity and POS tagging error rate 
(expressed as a percent). To show the effect of the richer 
modeling of the context, we vary the amount of context 
given to the decision tree. As shown by the perplexity 
results, the context used for traditional POS-based lan- 
guage models (second column) is very impoverished. As 
we remove the simplifications to the context, we see the 
perplexity and POS tagging rates improve. By using both 
the previous words and previous POS tags as the context, 



Context for Wi 


Pi 


Pi-3,i 


Pi-l,iWi-3,i-l 


Pi 


3,1^4-34-1 


Content for Pi 


Pi-i,i-\ 


Pi-3,i-i 


Pi-3,i-l 


Pi-: 


,i-lWi-3,i-l 


POS Error Rate 


3.13 


3.10 


3.03 




2.97 


Perplexity 


42.32 


32.11 


29.49 




24.17 



Table 1 : Using Richer Contexts 



4.2 Constraining the Decision Tree 

As we mentioned earlier, the word identity information 
Wi-j is viewed as further refining the POS tag of the 
word Pi-j- Hence, questions about the word encoding 
are only allowed if the POS tag is uniquely defined. Fur- 
thermore, for both POS and word questions, we restrict 
the algorithm so that it only asks about more specific bits 
of the POS tag and word encodings only if it has already 
uniquely identified the less specific bits. In Table ^, we 
contrast the effectiveness of adding further constraints. 
The second column gives the results of adding no further 
constraints, the third column only allows questions about 
a POS tag Pi_j_i only if Pi-j is uniquely determined, 
and the fourth column adds the constraint that the word 
Wi-j must also be uniquely identified before questions 
are allowed of Pi-j-%. 

From the table, we see that it is worthwhile to force the 
decision tree to fully explore a POS tag for a word in the 
context before asking about previous words. Hence, we 
see that the decision tree algorithm needs help in learn- 
ing that it is better to fully explore the POS tags. How- 
ever, we see that adding the further constraint that the 
word identity should also be fully explored results in a 
decrease in performance of the model. Hence, we see 
that it is not worthwhile for the decision tree to fully ex- 
plore the word information (which is the basis of class- 
based approaches to language modeling), and it is able to 
learn this on its own. 

4.3 Effect of Composites 

The next area we explore is the benefit of composite 
questions in estimating the probability distributions. The 
second column of Table |3] gives the results if compos- 
ite questions are not employed, the third column gives 
the results if composite questions are employed, and the 
fourth gives the results if we employ a beam search in 
finding the best pylon (with up to 10 alternatives). From 





None 


POS 


Full 


POS Error Rate 
Perplexity 


3.19 
25.64 


2.97 
24.17 


3.00 
24.39 



Table 2: Adding Additional Constraints 



the results, we see that the use of pylons reduces the word 
perplexity rate by 4.7%, and the POS error rate by 2.3%. 
Furthermore, we see that using a beam search, rather than 
an entirely greedy algorithm accounts for some of the im- 
provement. 





Not Used 


Single 


10 


POS Error Rate 
Perplexity 


3.04 
25.36 


3.04 
24.36 


2.97 
24.17 



Table 3: Effect of Composite Questions 



work (Heeman, 1997; |Heeman and Allen, 1997 ), this 
syntactic information, as well as the techniques intro- 
duced in this paper, are used to help model the oc- 
currence of dysfiuencies and intonational phrasing in a 
speech recognition language model. Our use of deci- 
sion trees to estimate the probability distributions proves 
effective in dealing with the richer context provided by 
modeling these spontaneous speech events. Modeling 
these events improves the perplexity to 22.5, a 14% im- 
provement over the word-based trigram backoff model, 
and reduces the POS error rate by 9%. 



4.4 Effect of Larger Context 

In Table [l| we look at the effect of the size of the con- 
text, and compare the results to a word-based backoff 
language model (iKatz, 1987b built using the CMU toolkit 



(Rosenfeld, 1995). For a bigram model, it has a per- 
plexity of 29.3, in comparison to our word perplexity of 
27.4. For a trigram model, the word-based model has 
a perplexity of 26.1, in comparison to our perplexity of 
24.2. Hence we see that our POS-based model results in 
a 7.2% improvement in perplexity. 





Bigram 


Trigram 


4-gram 


POS Error Rate 
Perplexity 


3.19 
27.37 


2.97 
24.26 


2.97 
24.17 


Word-based Model 


29.30 


26.13 





Table 4: Using Larger Contexts 



5 CONCLUSION 

In this paper, we presented a new way of incorporating 
POS information into a language model. Rather than 
treating POS tags as intermediate objects solely for rec- 
ognizing the words, we redefine the speech recognition 
problem so that its goal is to find the best word sequence 
and their best POS assignment. This approach allows 
us to use the POS tags as part of the context for esti- 
mating the probability distributions. In fact, we view the 
word identities in the context as a refinement of the POS 
tags rather than viewing the POS tags and word identi- 
ties as two separate sources of information. To deal with 
this rich context, we make use of decision trees, which 
can use information theoretic measures to automatically 
determine how to partition the contexts into equivalence 
classes. We find that this model results in a 7.2% re- 
duction in perplexity over a trigram word-based model 
for the Trains corpus of spontaneous speech. Currently, 
we are exploring the effect of this model in reducing the 
word error rate. 

Incorporating shallow syntactic information into the 
speech recognition process is just the first step. In other 
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